To CaaS or Not to CaaS?

Compare the favorable functionalities of CaaS among Google, AWS, and Azure.

How to decide?#

Should we use managed Containers as a Service? That must be the most crucial question we should try to answer. Unfortunately, it’s hard to provide a universal answer since the solutions differ significantly from one provider to another. Currently (July 2020), CaaS can be described as a Wild West, with solutions ranging from amazing to useless.

Before we attempt to answer this big question, let’s go through some of the things we learned by exploring Google Cloud Run, AWS ECS with Fargate, and Azure Container Instances.

Let's decide which path to take

AWS, Google, or Azure?#

We can compare those three from different angles.

Simplicity#

Ease of use is one of the most essential benefits of serverless computing. It’s supposed to allow engineers to provide code or binaries (in one form or another) with a reasonable expectation that the platform of choice will do most of the rest of the work.

From a simplicity perspective, both Google Cloud Run and Azure Container Instances are exceptional. They allow us to deploy our container images without almost any initial setup:

  • Google needs only a project.
  • Azure requires only a resource group.

On the other hand, AWS needs over twenty different bits and pieces (resources) to be assembled before we can even start thinking about deploying something to ECS. Even after all the infrastructure is set up, we need to create a task definition, a service, and a container definition. If simplicity is what we’re looking for, ECS is not it.

It’s horrifyingly complicated, and it’s far from the “give us a container image, we’ll take care of the rest” approach we're all looking for when switching to serverless deployments. Surprisingly, a company that provides such an amazing Functions as a Service solution (Lambda) didn't do something similar with ECS. If AWS took the same approach with ECS as with Lambda, it would likely be the winner.

From the simplicity of setup and deployment perspective, Azure and Google are clear winners.

Infrastructure#

Now that we’ve mentioned infrastructure in the context of the initial setup, we might want to take that as a criterion as well:

  • There’s no infrastructure for us to manage when using CaaS in Google Cloud or Azure. They take care of all the details.
  • AWS, on the other hand, forces us to create a full-blown cluster. That alone can disqualify AWS ECS with Fargate from being considered as a serverless solution. We’re not even sure whether we could qualify it as Containers as a Service.

Note: As a matter of fact, we prefer using Elastic Kubernetes Engine (EKS). It’s just as easy, if not easier, than ECS, and it adheres to widely accepted standards and doesn’t lock us into a suboptimal proprietary solution.

Discussing the infrastructure

Scalability#

Question

How about scalability? Do our applications scale when deployed into managed Containers as a Service solution?

Hide Answer

The answer to that question changes the story.

Google Cloud Run is scalable by design. It’s based on Knative, which is a Kubernetes resource designed for serverless workloads. It scales without us even specifying anything. Unless we overwrite the default behavior, it will create a replica of our application for every hundred concurrent requests. If there are no requests, no replicas will run. If it jumps to three hundred, it will scale to three replicas. It will queue requests if none of the replicas can handle them, and scale up and down to accommodate fluctuations in traffic. All that will happen without us providing any specific information. It has sane defaults while still providing the ability to fine-tune the behavior to match our particular needs.

Applications deployed to ECS are scalable as well. But it’s not easy.

Discussing scalability

Scaling applications deployed to ECS is complicated and limiting. Even if we can overlook those issues, it doesn’t scale to zero replicas. At least one replica of our application needs to run at all times because there is no built-in mechanism to queue requests and spin up new replicas. From that perspective, scaling applications in ECS is not what we’d expect from serverless computing. It’s similar to what we would get from HorizontalPodAutoscaler in Kubernetes. It can go up and down, but never to zero replicas. Given that there’s a scaling mechanism of sorts, but it cannot go down to zero replicas and it’s limiting in what it can actually do, we’d say that ECS only partially fulfils the scalability needs of our applications, at least in the context of serverless computing.

Unlike Google Cloud Run and ECS, it doesn’t use a scheduler. There’s no scaling of any kind. All we can do is run single-replica containers isolated from each other. That alone means that Azure Container Instances cannot be used in production for anything but small businesses. Even in those cases, it’s still not a good idea to use ACI for production workloads. The only applicable use case might be for situations in which the application cannot scale. If we have one of those old, often stateful applications that can run only in single-replica mode, we might consider Azure Container Instances. For anything else, the inability to scale is a show stopper.

Simply put, Azure Container Instances provide a way to run Docker containers in Cloud. There’s not much more to it, and we know that Docker alone is not enough for anything but development purposes.

We would say that even development with Docker alone is not a good idea.

Lock in#

Another potentially important criterion is the level of lock-in. ECS (with or without Fargate) is fully proprietary and forces us to rely entirely on AWS. The amount of resources we need to create and the format for writing application definitions ensures that we’re locked into AWS.

If we choose to use it, we won’t be able to move anywhere else, at least not easily. That doesn’t necessarily mean the benefits don’t outweigh the potential cost behind being locked in, but, instead, we might need to be aware of it when making the decision whether to use it or not.

Azure Container Instances are also fully proprietary but, given that all the investment is in creating container images and running a single command, we aren’t locked in. The investment is very low, so if we choose to switch to another solution or a different provider, we should be able to do that with relative ease.

The issue with ECS is not lock-in itself. There’s nothing wrong with using proprietary solutions that solve problems in a better way than open alternatives. The problem is that ECS is not any better than Kubernetes. So, the problem with being locked into ECS is that we’re locked into a service that isn’t as good as the more open counterpart provided by the same company (AWS EKS). That doesn’t mean that EKS is the best managed Kubernetes service (it is not), but that, within the AWS ecosystem, it’s probably a better choice.

Google Container Run is based on Knative, which is open source and open standard. Google is only providing a layer on top of it. We can even deploy it using Knative definitions, which can be installed in any Kubernetes cluster. And it’s easy to move to a different service if we need to.

High availability#

Google Cloud Run was the only solution that didn't produce 100% availability in our tests with Siege. So far, that's the first negative point we could give it. That's a severe downside. That doesn't mean that it's not highly available, but rather that it tends to produce only a few nines after the decimal (e.g., 99.99). However, that’s not a bad result by any means.

If we did more serious testing, we would see that over a more extended period and with a higher number of requests, the other solutions would also drop below 100% availability. Nevertheless, with a smaller sample, Azure Container Instances and AWS ECS did produce better results than Google Cloud Run, and that's not something we should ignore.

Azure Container Instances, on the other hand, can only handle limited traffic. The inability to scale horizontally inevitably leads to failure to be highly available. We didn’t experience that with our Siege tests mostly because a single replica was able to handle a thousand concurrent requests. If we increased the load, it would start collapsing by reaching the limit of what one replica can handle.

ECS provides the highest availability, as long as we set up horizontal scaling. We need to work for it.

The most important question to answer is whether any of those services are production-ready. We already saw that Azure Container Instances should not be used in production, except for very specific use cases.

Google Cloud Run and AWS ECS, on the other hand, are production-ready. Both provide all the features we might need when running production workloads. The significant difference is that ECS exists for much longer, while Google Cloud Run is a relatively new service, at least at the time of this writing (July 2020). Nevertheless, it’s based on Google Kubernetes Engine (GKE), which is considered the most mature and stable managed Kubernetes we can use today.

Given that Google Cloud Run is only a layer on top of GKE, we can safely assume that it’s stable enough. The bigger potential problem is in Knative itself. It’s a relatively new project that hasn’t yet reached the first GA release (at the time of this writing, the latest release is 0.16.0). Nevertheless, major software vendors are behind it. Even though it might not yet be battle tested, it’s getting very close to being the preferable way to run serverless computing in Kubernetes.

Deploying Applications to Azure Container Instances

Summary: CaaS